Overall Question

  1. Does Smoking Lead to Babies Born Prematurely?
  2. Does Smoking lead to Babies with Low Birth Weight?

Why it is interesting?

Generally speaking, smoking is harmful to our body. But in society, a lot of people still smoke and the tendency for women to smoke during preagnancy is increasing, which may have some negative effect on next generation. Babies are significant in a family, and also important for our future, thus in order to figure out what kind of effect of smoking will lead to on both mothers and babies, we analyzed whether smoking leads to babies born prematurely and whether smoking lead to babies with low birth weight. These results will give us solid insight on smoking’s effects on babies and mothers.

What new tools that we applied?

We applied modeling tools, loops, and dyplyr tools. These tools helped us dive mucdeeper into the dataset than origionally, when we only had the tools of ggplot at our disposal.

The new answers compared to previous lab

With the new tools at our disposal, we were able to draw more solid conclusions about the dataset. Before, some of the results were ambiguous, but now, they seem much more solidified.

summary(lm(Premature ~ smoke, babies))
## 
## Call:
## lm(formula = Premature ~ smoke, data = babies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.2146 -0.2146 -0.1691 -0.1691  0.8309 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.16914    0.01692   9.996   <2e-16 ***
## smoke        0.04544    0.02464   1.844   0.0655 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3925 on 1016 degrees of freedom
##   (20 observations deleted due to missingness)
## Multiple R-squared:  0.003335,   Adjusted R-squared:  0.002354 
## F-statistic:   3.4 on 1 and 1016 DF,  p-value: 0.06549

A linear model seeing how smoking can predict a premature birth shows us that the smoking variable is not significant. We acknowledge that this model is not the best way to determine significance considering we are attempting to find a linear relationship between two categorical variables which, intuitively, does not make a lot of sense.

summary(lm(bwtoz ~ smoke, babies))
## 
## Call:
## lm(formula = bwtoz ~ smoke, data = babies)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -67.78 -11.11   0.89  10.89  53.22 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 122.7776     0.7538  162.87  < 2e-16 ***
## smoke        -8.6681     1.0986   -7.89 7.72e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.58 on 1026 degrees of freedom
##   (10 observations deleted due to missingness)
## Multiple R-squared:  0.0572, Adjusted R-squared:  0.05628 
## F-statistic: 62.25 on 1 and 1026 DF,  p-value: 7.718e-15

We now consdier question number two with the summary of the linear model above. This shows us that smoking is a significant indicator of birthweight, which seems to contradict our first linear model.

Katie’s and Chris’s individual questions dive into a more rigourous answer to these two questions.

Conclusion

By combining the results from the first time we studied this lab with these new techniques, we can conclude that smoking during pregnancy does impact a baby’s birthweight. It is hard to see in the answer to Q1, but the linear model answering Q2 tells us that smoking is a significant indicator of birthweight.

Individual Findings

Lauren

Subquestion

Does the mother’s recorded pregnangt weight have a relationship with the baby’s recorded weight? If so, how strong is this relationship?

Why it is important?

The main goal of a pregnancy is to have a healthy baby. If a mother knew how her weight impacted her child’s, she may take action to change that weight. On the otherhand, as a pregnant woman, one is only expected to gain an additional 20-30 pounds. I am interested in seeing if the baby’s weight reaches a certain max regardless of the mother’s weight.

New tool I used

I created a linear model to visualize the relationship between the mother’s weight and the baby’s weight.

ggplot(babies, aes(x=mpregwt, y=bwtoz)) + 
  geom_point() +
  geom_smooth(method = 'lm')
## Warning: Removed 32 rows containing non-finite values (stat_smooth).
## Warning: Removed 32 rows containing missing values (geom_point).

First, I wanted ot visualize the data to see if it is reasonable to create a linear model for this relationship. There seems to be somewhat of a relationship, but with a lot of variance. After seeing this plot, I think it makes sense to look further into a linear model.

model <- lm(bwtoz~mpregwt, babies)
summary(model)
## 
## Call:
## lm(formula = bwtoz ~ mpregwt, data = babies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -65.421 -11.108   0.385  10.834  57.094 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 99.35103    3.47275  28.609  < 2e-16 ***
## mpregwt      0.15050    0.02666   5.645 2.15e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18 on 1004 degrees of freedom
##   (32 observations deleted due to missingness)
## Multiple R-squared:  0.03076,    Adjusted R-squared:  0.02979 
## F-statistic: 31.86 on 1 and 1004 DF,  p-value: 2.154e-08

The model shows us that the mother’s pregnant weight is significant in predicting the birthweight of the baby. However, the slope of the relationship is very small at 0.1505. Next, we can consider the residuals to check for any odd patterns.

plot(model)

Considering the residuals versus the fitted values, there does not appear to be a significant pattern in the residuals. It appears to be decently random. We can double check this thought using the QQ-plot. The standardized residuals do present taisl at either end of the plot. This tells us that a linear model may not be the best way to represent this data.

We could consider other ways to model this data with a spline or a generalized linear model, but this might not make sense for our data. If the baby’s weight reaches a max or a plateau regardless of the mother’s weight, this would be seen in a natural spline. We can consider a spline-like fit using ggplot shown below.

ggplot(babies, aes(x=mpregwt, y=bwtoz)) + 
  geom_point() +
  geom_smooth(method = 'loess')
## Warning: Removed 32 rows containing non-finite values (stat_smooth).
## Warning: Removed 32 rows containing missing values (geom_point).

Using this fit, there is an obvious linear relationship when a mother is less than 140 pounds. When the mother is heavier than this, it does not seem to impact the baby’s weight as much. As anticipated, the baby’s weight seems to reach a max.

Let us consider a linear model on just that first opart of the data where the mother’s pregnant weight is less than 140 pounds.

skinnyMoms <- babies %>% filter(mpregwt < 140)

skinnyModel <- lm(bwtoz~mpregwt, skinnyMoms)
summary(skinnyModel)
## 
## Call:
## lm(formula = bwtoz ~ mpregwt, data = skinnyMoms)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -54.394 -10.752   0.419  10.865  58.208 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 81.04144    6.52931  12.412  < 2e-16 ***
## mpregwt      0.30682    0.05459   5.621 2.68e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.26 on 756 degrees of freedom
## Multiple R-squared:  0.04011,    Adjusted R-squared:  0.03884 
## F-statistic: 31.59 on 1 and 756 DF,  p-value: 2.678e-08
plot(skinnyModel)

The summary of the restricted model shows us that the slope of the relationship has increased to 0.30682 and the the mother’s pregnant weight is still significant in predicting the the baby’s weight. The QQ-plot for the restriced model has less dramatic tails which tells us that a linear model is more appropriate for this filtered data.

Now, we can consider the plot. I will compare a linear fit (blue) to a spline fit (green).

ggplot(skinnyMoms, aes(x=mpregwt, y=bwtoz)) + 
  geom_point() +
  geom_smooth(method = 'lm', color = "blue") +
  geom_smooth(method = "loess", color = "green")

As we can see, the linear fit for this reduced data set is a much better fit than for the entire data set.

It is clear that a mother’s pregnant weight does impact the baby’s birthweight. However, the baby’s weigth tends to reach a sort of max regardless of the mother’s weight. Additionally, it is more benficial to predict the baby’s weight using a linear model when the mother’s pregnant weight is less than 140 pounds.

Katie

Subquestion

Does short gestation days causes light babies weight based on smoking and nonsmoking? #### why it is important? Because the short gestation days may lead light babies weight directly, and I think it is an important part for answering our overall questions.And generally speaking, we all think it will cause light weight, but is there a real relationship between them ? I want to explore it more deeply.

New tool I used

In order to find the relationship between gestation days and the babies weight, I build a new model which is linear relationship of log gestation days with log weight. Because taking log will make our data more clear and easier to see the tendency. And I add predictions and residuals to see the difference between model and the real data.

babies1 <- babies1 %>% filter(gestation <= 270) %>% 
  mutate(lweight = log(bwtoz),lges= log(gestation)) %>% select(gestation,bwtoz,lweight,lges,number,smoke) 

ggplot(babies1,aes(lges,lweight))+
  geom_hex(bins = 50)+
  geom_smooth(method = "loess")+
  facet_wrap(~smoke)+
  ggtitle("Tendency of gestation days with weight")

model_babies1 <-lm(lweight~lges,data = babies1)

I use data_grid to create a model data for gestation days and babies weight, and put the estimated one with the real data together.

grid <- babies1 %>% 
  data_grid(gestation = seq_range(gestation,270))%>%
  mutate(lges = log2(gestation))%>%
  add_predictions(model_babies1,"lweight")%>%
  mutate(bwtoz = 2^lweight)


ggplot(babies1, aes(gestation,bwtoz )) + 
  geom_hex(bins = 50) + 
  geom_line(data = grid, colour = "red", size = 1)+
  facet_wrap(~smoke)+
  ggtitle("Linear model with real data")

We can see from the plot above, there are lots of babies weight above the linear model for nonsmoking one, and more points are below for smoking group. And the majority of points are lying much more above in nonsmoking than smoking group.

babies1 <- babies1 %>%
  add_residuals(model_babies1,"lreids")
ggplot(babies1,aes(lweight,lreids))+
  geom_hex(bins = 50)+
  geom_smooth(method = "loess")+
  ggtitle("residuals with lweight data")

ggplot(babies1, aes(as.character(gestation), lreids)) + geom_boxplot()+
  coord_flip()+
  facet_wrap(~smoke)+
  ggtitle("Smoke vs. Nonsmoke the residuals of gestation days")

Because there are many points above the linear model, so I looked at the residuals for the weight. In the first graph, there is a tendency of residuals along with weight. After that, I plot the blox plot of residuals and gestation days. The residuals for nonsmoking are larger than the smoking one, and that is what we are expected. Thus the final answer is that the short gestation days affect babies weight, especially for smoking mothers. And it is an important part of our overall question, because smoking may affect the gestation days, which will cause permature, and that have the relation with babies weight.

Ryan

Subquestion

What is the difference between the income distributions of smokers vs. nonsmokers? #### why it is important? This question is imporatnt because it gives some insight on whether a person’s income has any relationship with the likleyhood that they will be a smoker. With this information, one could target specific income groups with information about the dangers of smoking, hopefully causing greater change.

New tool I used

I used a while loop to manipulate a certain variable in my data frame. I also used fitting to explore trends in the data more deeply. Here, I observed the qualities of the fit, and used it to draw a conclusion about my dataset. I then observed the residuals of the fit, and attempted to justify whether or not the fit was any good. This was a very old lab, so I also implemented some dyplyr tools that were obviously useful here, which I am glad I know now, but wish I knew then. I also used the cor() function to get a preliminary idea of the relationship between income and smoking.

ggplot(subset(babies,smoke!= "NA"),mapping = aes(x=inc))+
  geom_bar(aes(y=..prop..,fill=..x..))+scale_fill_gradient(low="gray",high="yellow")+
  facet_wrap(~smoke)+labs(x = "Income",y="Proportion",title = "Income Distributions of Smokers vs. Non-Smokers")
## Warning: Removed 105 rows containing non-finite values (stat_count).

Here is shown the distribution of the proportion of people in each income group for smokers, and for non-smokers. It seems as if nonsmokers have a slightly greater density in the 4 lowest income groups than smokers. Also it seems as if smokers have a greater density in the middle income grops (3-5). To further explore this relationship, I will maniuplate the data in such a way that I can show the difference in

count_inc<-babies%>%group_by(inc)%>%count()
count_inc<-count_inc$n
babies_psmoke<-babies%>%group_by(inc)%>%arrange(inc)%>%filter(!is.na(smoke))%>%filter(!is.na(inc))%>%summarise(psmoke = sum(smoke))

i <-1
while (i<=10) {
  babies_psmoke[i,2] = babies_psmoke[i,2]/count_inc[i]
  i<-i+1
}


ggplot(data = babies_psmoke, aes(x = inc,y = psmoke))+
 geom_point()+
  geom_smooth()
## `geom_smooth()` using method = 'loess'

cor(babies_psmoke$inc,babies_psmoke$psmoke)
## [1] 0.2126354

It seems like the proportion of those people who smoke is very weakly correlated with income. It also seems like there is a peak of smoking frequency for the “middle class” income groups.

With some preliminary information about the relationship between income group and proportion of smokers, I will try to fit a low order polynomial to the data, and see if any more information about the trends can be gathered

model_psmoke <-lm(data = babies_psmoke,psmoke~poly(inc,2))

babies_psmoke<-babies_psmoke%>%add_predictions(model_psmoke)%>%add_residuals(model_psmoke)

ggplot(babies_psmoke,aes(x = inc))+
  geom_point(aes(y = psmoke))+
  geom_line(aes(y = pred),color = 'red')

The low degree polynomial fit (a degree 2 polynomial) seems to confirm the suspicion that middle income groups temd to be more likley to be smokers than either high or low income groups. It seems, however that the data fits fairly poorly, so it would be useful to look at the residuals, to gauge just how skeptical one should be about this conclusion.

ggplot(babies_psmoke,aes(x = inc))+
  geom_col(aes(y = abs(resid/psmoke)))

Here is plotted the “percent residual” for each of the data points, when using the degree two fit. This shows, that in general the model is closer than a 10% error, but has an error of above 15% in one case. This type of error would suggest to me a very poor fit of the data. The conclusion to be drawn from this is that there may be a higher liklihood for middle class income people to smoke, however more research is needed to solidify that conclusion.

Chris

Subquestion

how is the relationship between gestation day the people who smoke? does smoke lead premature?

why it is important?

cause at right now, most of people thought that when u smoke durning ur pregancy period, u got more prob that premature, so i want to find this is true or not and the prob that i can calculate.

New tool I used

at first i used the select function to mutate what variable that i need. “gestation” and “smoke” to be honest, it doesnt matter. than i used filter function to calculate and count the variable that i got. maybe the things that i learned from our lab 10 or lab11. than i just do the divide part to get a exact number to find my value.

babies1<-select(babies, gestation, smoke)
gg270<-babies1 %>% filter(gestation>=270)%>% count()
gl270<-babies1 %>% filter(gestation<270)%>% count()

gw<-babies1 %>% filter(smoke=="1")%>% count()
gwo<-babies1 %>% filter(smoke=="0")%>% count()

ggwo270<-babies1 %>% filter(gestation>=270 & smoke=="0")%>% count()
glwo270<-babies1 %>% filter(gestation<270 & smoke=="0")%>% count()

ggw270<-babies1 %>% filter(gestation>=270 & smoke=="1")%>% count()
glw270<-babies1 %>% filter(gestation<270 & smoke=="1")%>% count()

this steps just kind of filter steps and than count the value that i need for the fulture. gg270 means gestation day>=270 days, gl270 means less, gw means the number of people who smoke, gwo means people who doesnot smoke, ggw270 means gestation day>=270 and smoke, glw270 means gestation<270 and smoke, ggwo270 means people who gestation day>=270 and doesnt smoke. on the contrary, glwo270 means gestation day<270 and dont smoke.

#after 270 genestion
ggw270/gg270 #who smoke gestation>=270 /gestation>270
##          n
## 1 0.453125
ggwo270/gg270  #who doesnt smoke gestatio>=270 /gestation>270
##           n
## 1 0.5372596
#before 270 
glw270/gl270  #who smoke gestation<270 /gestation<270
##           n
## 1 0.5255102
glwo270/gl270  #who doesnt smoke gestation<270 /gestation<270
##           n
## 1 0.4642857
ggw270/gw #who smoke gestation>=270/people who smoke #0.7789
##           n
## 1 0.7789256
glwo270/gwo #who doesnt smoke gestation>=270/people who doesnt smoke #0.1673
##           n
## 1 0.1672794

the next two steps i just do some calculate steps, in the end, i found that people who smoke have more prob that the gestation day greater or equal to the 270 days. but people who doesnt smoke have less prob that the gestation day greater or equal to the 270 days. it is wired and seems breaks our mind but that the things that i found.

Contributions

Lauren

I investigated my question using linear models, residuals, qq-plots, and the idea of splines.

Katie

I outlined the lab14, and put overall questions and the importance part.Then I modified my original individual question and answered it by creating model of gestation days with babies weight.In the model, I created a data_grid to simulate the real data and put them together to observe the residuals and the predictions.

Chris

I did my individual part on time, and tried to analaysis the subquestion i made. whether the smoke or not will influence the pregency thing. I used some filter function and count the number that i got to. than i divide the two variable that i got tried to find the prob that i need. final, i found that the results looks the same as I did before in lab4, the smoke doesnt make any bay influence in geastation day.

Ryan

I used a while loop and some modeling tools to further explore the relationshiop between income group, and whether or not a person is a smoker.